Last updated: 2019-06-17

Checks: 6 1

Knit directory: ~/Research-Local/RNAseq-Local/TCGA-Nigerian-RNAseq/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190113) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

The following chunks had caches available:
  • DE test: Nigerian basal vs TCGA white basal and TCGA black basal for RUV
  • DE test: Nigerian basal vs TCGA white basal
  • DE test: TCGA black and TCGA white for RUV
  • DE test: TCGA black and TCGA white
  • Differential expression analysis -> batch effect comparison

To ensure reproducibility of the results, delete the cache directory NigerianTCGArawcountsDeSeq2-pc2_cache and re-run the analysis. To have workflowr automatically delete the cache directory prior to building the file, set delete_cache = TRUE when running wflow_build() or wflow_publish().

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store
    Ignored:    plots/.DS_Store

Untracked files:
    Untracked:  analysis/batcheffect/

Unstaged changes:
    Modified:   .Rhistory
    Deleted:    analysis/NigerianTCGArawcountsDeSeq2-pc2.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/about.Rmd
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.


#Translation from HTSeq raw counts -> DESeq2 table of raw counts I have 86 TCGA patients with whole-genome sequencing data and RNAseq data as well as 99 Nigerian patients with RNA-seq data. Raw counts were initially processed using HTSeq, so HTSeq raw counts (with one file per patient) are being formatted for use with DESeq2.

FOLDER <- "/Users/parajago/Research-Local/RNAseq-Local/Inputs/NigerianTCGA_quants-proteincoding"
sampleFiles <- grep("htseq.counts",list.files(FOLDER),value=TRUE)

#Differential gene expression setup based on race (b/w/other)
sampleConditionrace <- sampleFiles
countVar2=1
for (sample in sampleConditionrace){
  if (stri_detect_fixed(sample,"LIB")==TRUE){
    sampleConditionrace[countVar2] <- "Nigerian"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"black")==TRUE){
    sampleConditionrace[countVar2] <- "TCGA_black"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"white")==TRUE){
    sampleConditionrace[countVar2] <- "TCGA_white"
    countVar2=countVar2+1
  } else{
    sampleConditionrace[countVar2] <- "TCGA_other"
    countVar2=countVar2+1
  }
}

#Condition based on PAM50 subtype 
sampleConditionPAM50 <- sampleFiles
countVar2=1
for (sample in sampleConditionPAM50){
  if (stri_detect_fixed(sample,"Her2")==TRUE){
    sampleConditionPAM50[countVar2] <- "Her2"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"Basal")==TRUE){
    sampleConditionPAM50[countVar2] <- "Basal"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"LumA")==TRUE){
    sampleConditionPAM50[countVar2] <- "LumA"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"LumB")==TRUE){
    sampleConditionPAM50[countVar2] <- "LumB"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"PAMNL")==TRUE){
    sampleConditionPAM50[countVar2] <- "Normal"
    countVar2=countVar2+1
  } else{
    sampleConditionPAM50[countVar2] <- "PAM_other"
    countVar2=countVar2+1
  }
}

#Condition based on batch (relevant to the Nigerian patients only; no difference in batch for the TCGA patients)
batchval <- sampleFiles
countVar2=1
for (sample in batchval){
  if (stri_detect_fixed(sample,"batch1")==TRUE){
    batchval[countVar2] <- "batch1"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"batch23")==TRUE){
    batchval[countVar2] <- "batch23"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"batch4")==TRUE){
    batchval[countVar2] <- "batch4"
    countVar2=countVar2+1
  } else if (stri_detect_fixed(sample,"batch5")==TRUE){
    batchval[countVar2] <- "batch5"
    countVar2=countVar2+1
  } else{
    batchval[countVar2] <- "batchT"
    countVar2=countVar2+1
  }
}

table(sampleConditionrace, sampleConditionPAM50)
                   sampleConditionPAM50
sampleConditionrace Basal Her2 LumA LumB Normal PAM_other
         Nigerian      32   26   16   18      7         0
         TCGA_black    25    0    4    4      0         0
         TCGA_other     0    0    0    0      0        14
         TCGA_white    17    5    8    9      0         0
sampleTable2 <- data.frame(sampleName=gsub(".htseq.counts","",sampleFiles),
                          fileName=sampleFiles,
                          condition1=sampleConditionrace,
                          condition2=sampleConditionPAM50,
                          batch=batchval)

sampleTable2$sampleCondition <- paste(sampleTable2$condition1, sampleTable2$condition2, sep=".")

ddsHTSeqMF <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable2,
                                       directory=FOLDER,
                                       design=~sampleCondition)

ddsHTSeqMF <- ddsHTSeqMF[rowSums(counts(ddsHTSeqMF)) > 0, ] #Pre-filtering the dataset by removing the rows without information about gene expression

ddsHTSeqMF$sampleCondition <- relevel(ddsHTSeqMF$sampleCondition, ref = "Nigerian.Basal") #explicitly making the Nigerian Basal patients the reference population

#Initial exploratory analysis These are initial visualizations of our data sets to assess for batch effect between the Nigerian and TCGA sample sets.

dds <- estimateSizeFactors(ddsHTSeqMF) #The size factor is the median ratio of the sample over a "pseudosample": for each gene, the geometric mean of all samples. This accounts for sequencing depth. 

vsd <- vst(ddsHTSeqMF, blind=FALSE) #Variance-stabilizing transformation, only using this since >50 samples

df <- bind_rows( #Graphical representation of the variance-stabilizing transformation
          as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
          mutate(transformation = "log2(x + 1)"),
          as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"))
          colnames(df)[1:2] <- c("x", "y")

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) + coord_fixed() + facet_grid( . ~ transformation)

Figure 1: Graphical representation of the variance-stabilizing transformation relative to the standard transformation of log2(x+1). This reduces variance purely due to low read counts. VST is effectively employed in DESeq2. This transformation will be the input for subsequent visualizations (ONLY VISUALIZATIONS) in this section.

meanSdPlot(log2(counts(estimateSizeFactors(ddsHTSeqMF),normalized=TRUE)[rowSums(counts(ddsHTSeqMF)) > 0,] + 1))

meanSdPlot(assay(vsd[rowSums(counts(ddsHTSeqMF)) > 0,]))

Figure 2: Plot of mean to variance to confirm that variance shrinking does work. This is another visualization of the approach noted in Figure 1.

#PCAs at this point of the analysis are the VST transformed counts, using the top 500 variance genes
plotPCA(vsd, intgroup=c("condition1", "batch"), ntop=500)

PCA 1-1:PCA demonstrating the raw difference between Nigerian and TCGA patients with within-batch demonstration. Nigerian patients do not appear to have significant differneces based on their batches. (TCGA patients are known not to have significant difference based on batch performed.)

There is a factor separating the TCGA patients into 2 groups that is not wholly ancestry dependent.

Of note, the PCA as performed by only has 33% variance explained with two variables when transformation is performed.

plotPCA(vsd, intgroup=c("condition2"), ntop=1000)

PCA 1-2:PCA demonstrating association with breast cancer subtype and gene expression. This is noted in both groups and consistent with known biology. Again, only 32% variance is explained with 2 principal components.

plotPCA(vsd, intgroup=c("condition1", "condition2"), ntop=1000)

PCA 1-3: Integrated PCA demonstrating combined effect of breast cancer subtype and race. Subtype appears to account for the clustering seen in the TCGA group. Again, only 32% variance is explained with 2 principal components (with top 1000 genes).

#Heatmap to assess clusters pre-analysis
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 250)
hm  <- assay(vsd)[ topVarGenes, ]
hm  <- hm - rowMeans(hm)
anno <- as.data.frame(colData(vsd)[, c("condition1", "condition2")])
colnames(anno) <- c("Ethnicity", "PAM50-subtype")
rownames(anno) <- colnames(vsd)
pheatmap(hm, annotation_col = anno, cluster_rows=TRUE, cluster_cols=TRUE, clustering_distance_rows = "correlation", clustering_distance_cols = "correlation", clustering_method = "complete", show_rownames=FALSE, show_colnames = FALSE, main="Heat map: Relative VST-transformed\ncounts across samples without batch correction")

Heatmap 1-1: This heatmap is only using the top 250 genes and is competely heirarchically clustered. Results from this are consistent with the PCA.

#Specific issues associated with our batches -We do not know the extent to which general RNA expression varies within the Nigerian population compared to within TCGA populations because of ethnicity / common ancestral variants -Biological differences are potentially confounded by batch effect (as Nigerians are a separate batch from TCGA); although no evidence of within-batch effect

We will discuss the following methods to manage the batch effect and our reasoning for our chosen approach. (1) sva (1 principal component) (2) RUVseq of housekeeping genes (k=1, with 2 and 6 housekeeping genes) (3) Quantile normalization (4) Limma removeBatchEffect

#Method 1: sva to account for batch effect related to the Nigerian/TCGA samples We are using SVA to create an unsupervised hypothetical “batch effect” variable that is then used to account for testing differences between the two groups.

dat  <- counts(dds, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
mod  <- model.matrix(~condition1, colData(dds))
mod0 <- model.matrix(~   1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 1)
Number of significant surrogate variables is:  1 
Iteration (out of 5 ):1  2  3  4  5  
sampleTablebe <- data.frame(sampleName=gsub(".htseq.counts","",sampleFiles),
                          fileName=sampleFiles,
                          condition1=sampleConditionrace,
                          condition2=sampleConditionPAM50,
                          be1=svseq$sv[,1],
                          batch=batchval)

sampleTablebe$sampleCondition <- paste(sampleTablebe$condition1, sampleTablebe$condition2, sep=".")

ddsHTSeqMFbe <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTablebe,
                                       directory=FOLDER,
                                       design= ~be1+sampleCondition)

ddsHTSeqMFbe <- ddsHTSeqMFbe[rowSums(counts(ddsHTSeqMFbe)) > 0, ]

ddsHTSeqMFbe$sampleCondition <- relevel(ddsHTSeqMFbe$sampleCondition, ref = "Nigerian.Basal") #explicitly making the Nigerian Basal patients the reference population

Use of sva to create an unsupervised “batch effect” variable that effectively has full interaction with the Nigerian vs. TCGA variable does cause issues – performing differential expression analysis under these conditions does lead to not converging in beta.

I am performing an intial differential expression analysis with a batch-uncorrected version of the counts as well as the version that incorporates SVA-based batch correction.

ddsMF <- DESeq(ddsHTSeqMF) #This will refer to a version that does not include batch correction outside of what is accomplished through DESeq2 alone. 

ddsMFbe <-DESeq(ddsHTSeqMFbe) 
#4 rows did not converge in beta, labelled in mcols(object)$betaConv

resultsNames(ddsMF)
 [1] "Intercept"                                             
 [2] "sampleCondition_Nigerian.Her2_vs_Nigerian.Basal"       
 [3] "sampleCondition_Nigerian.LumA_vs_Nigerian.Basal"       
 [4] "sampleCondition_Nigerian.LumB_vs_Nigerian.Basal"       
 [5] "sampleCondition_Nigerian.Normal_vs_Nigerian.Basal"     
 [6] "sampleCondition_TCGA_black.Basal_vs_Nigerian.Basal"    
 [7] "sampleCondition_TCGA_black.LumA_vs_Nigerian.Basal"     
 [8] "sampleCondition_TCGA_black.LumB_vs_Nigerian.Basal"     
 [9] "sampleCondition_TCGA_other.PAM_other_vs_Nigerian.Basal"
[10] "sampleCondition_TCGA_white.Basal_vs_Nigerian.Basal"    
[11] "sampleCondition_TCGA_white.Her2_vs_Nigerian.Basal"     
[12] "sampleCondition_TCGA_white.LumA_vs_Nigerian.Basal"     
[13] "sampleCondition_TCGA_white.LumB_vs_Nigerian.Basal"     
resultsNames(ddsMFbe)
 [1] "Intercept"                                             
 [2] "be1"                                                   
 [3] "sampleCondition_Nigerian.Her2_vs_Nigerian.Basal"       
 [4] "sampleCondition_Nigerian.LumA_vs_Nigerian.Basal"       
 [5] "sampleCondition_Nigerian.LumB_vs_Nigerian.Basal"       
 [6] "sampleCondition_Nigerian.Normal_vs_Nigerian.Basal"     
 [7] "sampleCondition_TCGA_black.Basal_vs_Nigerian.Basal"    
 [8] "sampleCondition_TCGA_black.LumA_vs_Nigerian.Basal"     
 [9] "sampleCondition_TCGA_black.LumB_vs_Nigerian.Basal"     
[10] "sampleCondition_TCGA_other.PAM_other_vs_Nigerian.Basal"
[11] "sampleCondition_TCGA_white.Basal_vs_Nigerian.Basal"    
[12] "sampleCondition_TCGA_white.Her2_vs_Nigerian.Basal"     
[13] "sampleCondition_TCGA_white.LumA_vs_Nigerian.Basal"     
[14] "sampleCondition_TCGA_white.LumB_vs_Nigerian.Basal"     
fc = 0.58 #Subsequent threshold of signifcance for log2 fold change -> 0.58 = log2(1.5)
fdr = 0.05 #Subsequent threshold of significance for p-value (adjusted by FDR)

Warning: The above code chunk cached its results, but it won’t be re-run if previous chunks it depends on are updated. If you need to use caching, it is highly recommended to also set knitr::opts_chunk$set(autodep = TRUE) at the top of the file (in a chunk that is not cached). Alternatively, you can customize the option dependson for each individual chunk that is cached. Using either autodep or dependson will remove this warning. See the knitr cache options for more details.

##sva -> differential expression analysis overall results

cat("MA Plot 1: Original differential expression")
MA Plot 1: Original differential expression
res <- lfcShrink(ddsMF, coef="sampleCondition_TCGA_white.Basal_vs_Nigerian.Basal", type="ashr", optmethod = "mixEM")
DESeq2::plotMA(res, ylim=c(-10,10), xlim=c(0.1,200))

cat("MA Plot 2: Batch-corrected differential expression")
MA Plot 2: Batch-corrected differential expression
res2 <- lfcShrink(ddsMFbe, coef="sampleCondition_TCGA_white.Basal_vs_Nigerian.Basal", type="ashr" , optmethod = "mixEM")
DESeq2::plotMA(res2, ylim=c(-10,10), xlim=c(0.1,200))

An MAPlot demonstrates log2 fold changes attributable to the condition over the mean of normalized counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.

We see an overrepresentation of significant negative log fold changes of expression in the Nigerian patients relative to the TCGA white patients, but this is exacerbated after the sva correction.

##sva -> Inter-TCGA test

#Without batch correction
diffTCGA<- results(ddsMF, contrast=c("sampleCondition", "TCGA_black.Basal", "TCGA_white.Basal"), alpha=0.05)
nrow(diffTCGA)
[1] 19745
#diffTCGA$foldChange <- NA
#row.pos <- which(! is.na(diffTCGA$log2FoldChange) & 
#                diffTCGA$log2FoldChange >= 0)
#row.neg <- which(! is.na(diffTCGA$log2FoldChange) & 
#                diffTCGA$log2FoldChange < 0)
#diffTCGA$foldChange[row.pos] <- 2^diffTCGA$log2FoldChange[row.pos]
#diffTCGA$foldChange[row.neg] <- -2^((-1) * diffTCGA$log2FoldChange[row.neg])
#diffTCGA <- data.frame(id = row.names(diffTCGA), diffTCGA)

#before <- nrow(diffTCGA)
#diffTCGA <- diffTCGA[!is.na(diffTCGA$foldChange) & ! is.na(diffTCGA$padj),];
#after <- nrow(diffTCGA)
#print(paste0('Genes removed = ', (before - after), 
#             ' (fold change is NA)'))

diffTCGA <- diffTCGA[(diffTCGA$log2FoldChange >= fc | diffTCGA$log2FoldChange <= -fc),] 
diffTCGA <- subset(diffTCGA, padj < fdr)
nrow(diffTCGA)
[1] 1290
restemp <- lfcShrink(ddsMF, contrast=c("sampleCondition", "TCGA_black.Basal", "TCGA_white.Basal"), res = diffTCGA, type="ashr", optmethod = "mixEM")

restemp$temp <- row.names(restemp)
restemp$temp <- gsub("[.].+", "", restemp$temp)

restemp$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin TCGA black and \nTCGA white breast cancer patients")

with(restemp, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in TCGA black and \nTCGA white breast cancer patients", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp, padj<0.01 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp, padj<0.01 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

#With batch correction
diffTCGA2<- results(ddsMFbe, contrast=c("sampleCondition", "TCGA_black.Basal", "TCGA_white.Basal"), alpha=0.05)
nrow(diffTCGA2)
[1] 19745
#diffTCGA2$foldChange <- NA
#row.pos <- which(! is.na(diffTCGA2$log2FoldChange) & 
#                diffTCGA2$log2FoldChange >= 0)
#row.neg <- which(! is.na(diffTCGA2$log2FoldChange) & 
#                diffTCGA2$log2FoldChange < 0)
#diffTCGA2$foldChange[row.pos] <- 2^diffTCGA2$log2FoldChange[row.pos]
#diffTCGA2$foldChange[row.neg] <- -2^((-1) * diffTCGA2$log2FoldChange[row.neg])
#diffTCGA2 <- data.frame(id = row.names(diffTCGA2), diffTCGA2)

#before <- nrow(diffTCGA2)
#diffTCGA2 <- diffTCGA2[!is.na(diffTCGA2$foldChange) & ! is.na(diffTCGA2$padj),];
#after <- nrow(diffTCGA2)
#print(paste0('Genes removed = ', (before - after), 
#             ' (fold change is NA)'))

#diffTCGA2 <- diffTCGA2[(diffTCGA2$foldChange >= fc | diffTCGA2$foldChange <= -fc) & 
#              diffTCGA2$padj < fdr,]

diffTCGA2 <- diffTCGA2[(diffTCGA2$log2FoldChange >= fc | diffTCGA2$log2FoldChange <= -fc),] 
diffTCGA2 <- subset(diffTCGA2, padj < fdr)
nrow(diffTCGA2)
[1] 480
restemp2 <- lfcShrink(ddsMFbe, contrast=c("sampleCondition", "TCGA_black.Basal", "TCGA_white.Basal"), res = diffTCGA2, type="ashr", optmethod = "mixEM")

restemp2$temp <- row.names(restemp2)
restemp2$temp <- gsub("[.].+", "", restemp2$temp)

restemp2$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp2$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp2$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin TCGA black and white \n breast cancer patients\nBatch corrected")

with(restemp2, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in TCGA black and \nTCGA white breast cancer patients\nBatch corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp2, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp2, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

Warning: The above code chunk cached its results, but it won’t be re-run if previous chunks it depends on are updated. If you need to use caching, it is highly recommended to also set knitr::opts_chunk$set(autodep = TRUE) at the top of the file (in a chunk that is not cached). Alternatively, you can customize the option dependson for each individual chunk that is cached. Using either autodep or dependson will remove this warning. See the knitr cache options for more details.

The histogram of unadjusted p-values for both uncorrected and SVA corrected appears anticonservative. Both volcano plots demonstrate a tendency for increased log fold change in TCGA white relative to TCGA black, but the batch corrected volcano plot shows greater dispersion / more significance in findings that is concerning for inappropriate exaggeration of effect of positive differential expression.

The following is a demonstration example of comparing 32 Nigerian basal cases and 17 TCGA white basal cases.

##sva -> Basal Nigerian-TCGA white (cross population) test

#Without batch correction
resBasal<- results(ddsMF, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), alpha=0.05)
nrow(resBasal)
[1] 19745
#resBasal$foldChange <- NA
#row.pos <- which(! is.na(resBasal$log2FoldChange) & 
#                resBasal$log2FoldChange >= 0)
#row.neg <- which(! is.na(resBasal$log2FoldChange) & 
#                resBasal$log2FoldChange < 0)
#resBasal$foldChange[row.pos] <- 2^resBasal$log2FoldChange[row.pos]
#resBasal$foldChange[row.neg] <- -2^((-1) * resBasal$log2FoldChange[row.neg])
#resBasal <- data.frame(id = row.names(resBasal), resBasal)

#before <- nrow(resBasal)
#resBasal <- resBasal[!is.na(resBasal$foldChange) & ! is.na(resBasal$padj),];
#after <- nrow(resBasal)
#print(paste0('Genes removed = ', (before - after), 
#             ' (fold change is NA)'))

resBasal <- resBasal[(resBasal$log2FoldChange >= fc | resBasal$log2FoldChange <= -fc),]
resBasal <- subset(resBasal, padj < fdr)
nrow(resBasal)
[1] 7350
restemp <- NULL
restemp <- lfcShrink(ddsMF, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), res = resBasal, type="ashr", optmethod = "mixEM")

restemp$temp <- row.names(restemp)
restemp$temp <- gsub("[.].+", "", restemp$temp)

restemp$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin Nigerian and \nTCGA white breast cancer patients")

with(restemp, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in Nigerian and \nTCGA white breast cancer patients", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

#With batch correction
resBasal2<- results(ddsMFbe, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), alpha=0.05)
nrow(resBasal2)
[1] 19745
#resBasal2$foldChange <- NA
#row.pos <- which(! is.na(resBasal2$log2FoldChange) & 
#                resBasal2$log2FoldChange >= 0)
#row.neg <- which(! is.na(resBasal2$log2FoldChange) & 
#                resBasal2$log2FoldChange < 0)
#resBasal2$foldChange[row.pos] <- 2^resBasal2$log2FoldChange[row.pos]
#resBasal2$foldChange[row.neg] <- -2^((-1) * resBasal2$log2FoldChange[row.neg])
#resBasal2 <- data.frame(id = row.names(resBasal2), resBasal2)

#before <- nrow(resBasal2)
#resBasal2 <- resBasal2[!is.na(resBasal2$foldChange) & ! is.na(resBasal2$padj),];
#after <- nrow(resBasal2)
#print(paste0('Genes removed = ', (before - after), 
#             ' (fold change is NA)'))


resBasal2 <- resBasal2[(resBasal2$log2FoldChange >= fc | resBasal2$log2FoldChange <= -fc),]
resBasal2 <- subset(resBasal2, padj < fdr)
nrow(resBasal2) #More genes are considered significant with the batch correction method
[1] 7688
restemp2 <- NULL
restemp2 <- lfcShrink(ddsMFbe, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), res = resBasal2, type="ashr", optmethod = "mixEM")

restemp2$temp <- row.names(restemp2)
restemp2$temp <- gsub("[.].+", "", restemp2$temp)

restemp2$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp2$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp2$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin Nigerian and \nTCGA white breast cancer patients\n batch corrected")

with(restemp2, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in Nigerian and \nTCGA white breast cancer patients\n batch corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp2, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp2, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

Warning: The above code chunk cached its results, but it won’t be re-run if previous chunks it depends on are updated. If you need to use caching, it is highly recommended to also set knitr::opts_chunk$set(autodep = TRUE) at the top of the file (in a chunk that is not cached). Alternatively, you can customize the option dependson for each individual chunk that is cached. Using either autodep or dependson will remove this warning. See the knitr cache options for more details.

Although unadjusted p-value histogram appears anti-conservative again for both non batch corrected and sva corrected, volcano plot with batch correction appears to have an exaggered effect with numerous genes demonstrating >50x log fold change with significance. The volcano plot without batch correction is more consistent with expected findings, making the sva method concerning for use.

#Method 2: RUVseq of housekeeping genes Based on the paper by Krasnov et al. (https://www.frontiersin.org/articles/10.3389/fgene.2019.00097/full), we examined the housekeeping genes UBC and PUM1 across the Nigerian and TCGA samples to assess their differential expression. We also used the paper by Tilli et al. (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2946-1) to assess the efficacy of adding CCSER2, SYMPK and ANKRD17 for housekeeping comparison.

UBC: ENSG00000150991.10 PUM1: ENSG00000134644.11 CCSER2: ENSG00000107771.11 SYMPK: ENSG00000125755.14 ANKRD17: ENSG00000132466.13 -> Given the broad distribution of expression of this gene, this did not appear to be a good standard for comparison in our cohort.

resBasal<- results(ddsMF, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), alpha=0.05)

genename <- "ENSG00000150991.10"
plotCounts(ddsMF, gene = genename, intgroup=c("condition1"), main="Distribution of UBC expression across groups (before DE analysis)")

genename <- "ENSG00000134644.11"
plotCounts(ddsMF, gene = genename, intgroup=c("condition1"), main="Distribution of PUM1 expression across groups (before DE analysis)")

genename <- "ENSG00000107771.11"
plotCounts(ddsMF, gene = genename, intgroup=c("condition1"), main="Distribution of CCSER2 expression across groups (before DE analysis)")

genename <- "ENSG00000125755.14"
plotCounts(ddsMF, gene = genename, intgroup=c("condition1"), main="Distribution of SYMPK expression across groups (before DE analysis)")

genename <- "ENSG00000132466.13"
plotCounts(ddsMF, gene = genename, intgroup=c("condition1"), main="Distribution of ANKRD17 expression across groups (before DE analysis)")

ubctempassay <- subset(resBasal, rownames(resBasal)=="ENSG00000150991.10")
ubctempassay[,1:6]
log2 fold change (MLE): sampleCondition Nigerian.Basal vs TCGA_white.Basal 
Wald test p-value: sampleCondition Nigerian.Basal vs TCGA_white.Basal 
DataFrame with 1 row and 6 columns
                           baseMean   log2FoldChange             lfcSE
                          <numeric>        <numeric>         <numeric>
ENSG00000150991.10 6468.18351937685 1.15934912059213 0.156290542385703
                               stat               pvalue
                          <numeric>            <numeric>
ENSG00000150991.10 7.41790963736642 1.18983229376879e-13
                                   padj
                              <numeric>
ENSG00000150991.10 3.15392280319121e-12
pum1tempassay <- subset(resBasal, rownames(resBasal)=="ENSG00000134644.11")
pum1tempassay[,1:6]
log2 fold change (MLE): sampleCondition Nigerian.Basal vs TCGA_white.Basal 
Wald test p-value: sampleCondition Nigerian.Basal vs TCGA_white.Basal 
DataFrame with 1 row and 6 columns
                           baseMean     log2FoldChange             lfcSE
                          <numeric>          <numeric>         <numeric>
ENSG00000134644.11 1636.80154138544 -0.510595789540817 0.118657334717632
                                stat             pvalue
                           <numeric>          <numeric>
ENSG00000134644.11 -4.30311190417245 1.684157315584e-05
                                   padj
                              <numeric>
ENSG00000134644.11 8.82556675117563e-05
ccser2tempassay <- subset(resBasal, rownames(resBasal)=="ENSG00000107771.11")
ccser2tempassay[,1:6]
log2 fold change (MLE): sampleCondition Nigerian.Basal vs TCGA_white.Basal 
Wald test p-value: sampleCondition Nigerian.Basal vs TCGA_white.Basal 
DataFrame with 1 row and 6 columns
                           baseMean     log2FoldChange             lfcSE
                          <numeric>          <numeric>         <numeric>
ENSG00000107771.11 632.645416027419 -0.425179623227853 0.160507936060761
                                stat              pvalue              padj
                           <numeric>           <numeric>         <numeric>
ENSG00000107771.11 -2.64896324544912 0.00807391118830482 0.020496440185416
sympktempassay <- subset(resBasal, rownames(resBasal)=="ENSG00000125755.14")
sympktempassay[,1:6]
log2 fold change (MLE): sampleCondition Nigerian.Basal vs TCGA_white.Basal 
Wald test p-value: sampleCondition Nigerian.Basal vs TCGA_white.Basal 
DataFrame with 1 row and 6 columns
                           baseMean     log2FoldChange             lfcSE
                          <numeric>          <numeric>         <numeric>
ENSG00000125755.14 1584.27594103114 0.0985301368863784 0.218437517304813
                               stat            pvalue              padj
                          <numeric>         <numeric>         <numeric>
ENSG00000125755.14 0.45106782984027 0.651940662787615 0.745496730733869

UBC and PUM1 (as well as CCSER2) demonstrate signifcantly different distribution between the Nigerian and TCGA samples (although the log2 fold change is less for the PUM1 gene). SYMPK does not have significantly different distribution. This suggests that normalization of the housekeeping genes can be used as a means of comparing the RNA read counts.

genenames <- c("ENSG00000150991.10", "ENSG00000134644.11", "ENSG00000107771.11", "ENSG00000125755.14")

set1 <- RUVg(counts(ddsHTSeqMF), genenames, k=1)

plotRLE(counts(ddsHTSeqMF), outline=FALSE, ylim=c(-4, 4))

plotRLE(set1$normalizedCounts, outline=FALSE, ylim=c(-4, 4))

set1pca <- as.data.frame(set1$normalizedCounts)
t_set1pca <- t(set1pca)
t_set1pca <- as.data.frame(t_set1pca)

t_set1pca <- cbind (t_set1pca, sampleTable2) #This binds the characteristics of the original patients to the quantile normalized counts. CBinding was checked to make sure that patients were correctly aligned to characteristics. 

ruv.pca <- prcomp(t_set1pca[,1:19745])
autoplot(ruv.pca, data=t_set1pca, colour='sampleCondition', main="PCA of RUV normalization results prior to DE analysis", ylim=c(-0.05, 0.05))

sampleTableRUV <- data.frame(sampleName=gsub(".htseq.counts","",sampleFiles),
                          fileName=sampleFiles,
                          condition1=sampleConditionrace,
                          condition2=sampleConditionPAM50,
                          W_1 = set1$W)

sampleTableRUV$sampleCondition <- paste(sampleTableRUV$condition1, sampleTableRUV$condition2, sep=".")

ddsHTSeqMFRUV <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTableRUV,
                                       directory=FOLDER,
                                       design= ~W_1+sampleCondition)

ddsHTSeqMFRUV <- ddsHTSeqMFRUV[rowSums(counts(ddsHTSeqMFRUV)) > 0, ]

ddsHTSeqMFRUV$sampleCondition <- relevel(ddsHTSeqMFRUV$sampleCondition, ref = "Nigerian.Basal") #explicitly making the Nigerian Basal patients the reference population

ddsMFRUV <- DESeq(ddsHTSeqMFRUV) #Running differential expression with the two housekeeping genes for normalization

The RUV normalization demonstrates a different PCA with some clustering based on Nigerian vs. TCGA and some loss of clustering based on subtype relative to VSD normalization or SVA batch effect correction.

##RUVseq -> differential expression analysis overall results

cat("MA Plot: RUVSeq-corrected differential expression")
MA Plot: RUVSeq-corrected differential expression
res2 <- lfcShrink(ddsMFRUV, coef="sampleCondition_TCGA_white.Basal_vs_Nigerian.Basal", type="ashr" , optmethod = "mixEM")
DESeq2::plotMA(res2, ylim=c(-10,10), xlim=c(0.1,200))

Again, we see an overrepresentation of significant negative log fold changes of expression in the Nigerian patients relative to the TCGA white patients that is exacerbated by the RUV correction.

##RUVseq -> Inter-TCGA test

#With RUV correction
diffTCGA3<- results(ddsMFRUV, contrast=c("sampleCondition", "TCGA_black.Basal", "TCGA_white.Basal"), alpha=0.05)
nrow(diffTCGA3)
[1] 19745
diffTCGA3 <- diffTCGA3[(diffTCGA3$log2FoldChange >= fc | diffTCGA3$log2FoldChange <= -fc),] 
diffTCGA3 <- subset(diffTCGA3, padj < fdr)
nrow(diffTCGA3)
[1] 347
restemp3 <- NULL
restemp3 <- lfcShrink(ddsMFRUV, contrast=c("sampleCondition", "TCGA_black.Basal", "TCGA_white.Basal"), res = diffTCGA3, type="ashr", optmethod = "mixEM")

restemp3$temp <- row.names(restemp3)
restemp3$temp <- gsub("[.].+", "", restemp3$temp)

restemp3$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp3$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp3$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin TCGA black and white \n breast cancer patients\nRUV corrected")

with(restemp3, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in TCGA black and \nTCGA white breast cancer patients\nRUV corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp3, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp3, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

Warning: The above code chunk cached its results, but it won’t be re-run if previous chunks it depends on are updated. If you need to use caching, it is highly recommended to also set knitr::opts_chunk$set(autodep = TRUE) at the top of the file (in a chunk that is not cached). Alternatively, you can customize the option dependson for each individual chunk that is cached. Using either autodep or dependson will remove this warning. See the knitr cache options for more details.

Again, we see an overrepresentation of differentially expressed genes on the positive side relative to the negative, concerning for the same issue as seen in SVA batch correction.

##RUVSeq -> Basal Nigerian-TCGA white and TCGA black (cross population) tests

#With RUV correction
resBasal3<- results(ddsMFRUV, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), alpha=0.05)
nrow(resBasal3)
[1] 19745
resBasal3 <- resBasal3[(resBasal3$log2FoldChange >= fc | resBasal3$log2FoldChange <= -fc),]
resBasal3 <- subset(resBasal3, padj < fdr)
nrow(resBasal3)
[1] 8777
restemp3 <- NULL
restemp3 <- lfcShrink(ddsMFRUV, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_white.Basal"), res = resBasal3, type="ashr", optmethod = "mixEM")

restemp3$temp <- row.names(restemp3)
restemp3$temp <- gsub("[.].+", "", restemp3$temp)

restemp3$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp3$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp3$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin Nigerian and \nTCGA white breast cancer patients\n RUV corrected")

with(restemp3, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in Nigerian and \nTCGA white breast cancer patients\nRUV corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp3, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp3, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

resBasal4<- results(ddsMFRUV, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_black.Basal"), alpha=0.05)
nrow(resBasal4)
[1] 19745
resBasal4 <- resBasal4[(resBasal4$log2FoldChange >= fc | resBasal4$log2FoldChange <= -fc),]
resBasal4 <- subset(resBasal4, padj < fdr)
nrow(resBasal4)
[1] 7838
restemp4 <- NULL
restemp4 <- lfcShrink(ddsMFRUV, contrast=c("sampleCondition", "Nigerian.Basal", "TCGA_black.Basal"), res = resBasal4, type="ashr", optmethod = "mixEM")

restemp4$temp <- row.names(restemp4)
restemp4$temp <- gsub("[.].+", "", restemp4$temp)

restemp4$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restemp4$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restemp4$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin Nigerian and \nTCGA black breast cancer patients\n RUV corrected")

with(restemp4, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in Nigerian and \nTCGA black breast cancer patients\nRUV corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restemp4, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restemp4, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

Warning: The above code chunk cached its results, but it won’t be re-run if previous chunks it depends on are updated. If you need to use caching, it is highly recommended to also set knitr::opts_chunk$set(autodep = TRUE) at the top of the file (in a chunk that is not cached). Alternatively, you can customize the option dependson for each individual chunk that is cached. Using either autodep or dependson will remove this warning. See the knitr cache options for more details.

Although unadjusted p-value histogram appears anti-conservative again for RUV data, volcano plot with RUV correction appears to have the same exaggered effect as seen with the SVA batch effect.

There is some reassurance that both methods are finding genes in common.

#Method 3: Quantile normalization Quantile normalization is another technique that can be used to make distributions comparable. To quantile normalize two or more distributions to each other, without a reference distribution, one must sort the values within the distribution, then set to the average (usually, arithmetic mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

countmatrix <- assay(ddsHTSeqMF) #Raw counts organized into matrix format from individual files
countmatrix2 <- log2(countmatrix + 1) #Basic transformation of the count data 
countmatrix2 = countmatrix2 [rowMeans(countmatrix2) > 0, ] #Remove if there is no data re: counts

plot(density(countmatrix2[,1]),lwd=3,ylim=c(0,.30), main="Density of counts with log2[count]+1 transformation ONLY") 
for(i in 1:185){lines(density(countmatrix2[,i]),lwd=3)} #This demonstrates that there is a difference in distributions between the Nigerian and TCGA data with basic log transformation normalization 

norm_countmatrix <- as.matrix(countmatrix2) 
norm_countmatrix = normalize.quantiles(norm_countmatrix)
plot(density(norm_countmatrix[,1]),lwd=3,ylim=c(0,.3), main="Density of counts with quantile normalization")
for(i in 1:184){lines(density(norm_countmatrix[,i]),lwd=3)} #This demonstrates the effect of comparative quantile normalization

colnames (norm_countmatrix) <- colnames (countmatrix2)
rownames (norm_countmatrix) <- rownames (countmatrix2)

norm_countmatrix <- as.data.frame(norm_countmatrix)
countmatrixNigerian <- dplyr::select(norm_countmatrix, contains("LIB"))
plot(density(countmatrixNigerian[,1]),lwd=3,ylim=c(0,.3), main="Density of counts with quantile normalization - Nigerian")
for(i in 1:98){lines(density(countmatrixNigerian[,i]),lwd=3)} #This demonstrates the result of the normalized Nigerian counts separately

tcgacolnames <- colnames(countmatrix)
tcgacolnames <- setdiff(tcgacolnames, colnames(countmatrixNigerian))
countmatrixTCGA <- norm_countmatrix[ , tcgacolnames]
plot(density(countmatrixTCGA[,1]),lwd=3,ylim=c(0,.3), main="Density of counts with quantile normalization - TCGA")
for(i in 1:85){lines(density(countmatrixTCGA[,i]),lwd=3)} #This demonstrates the result of the normalized TCGA counts separately

norm_countmatrix <- as.data.frame(norm_countmatrix)
t_norm_countmatrix <- t(norm_countmatrix)

t_norm_countmatrix <- cbind (t_norm_countmatrix, sampleTable2) #This binds the characteristics of the original patients to the quantile normalized counts. CBinding was checked to make sure that patients were correctly aligned to characteristics. 

quant.pca <- prcomp(t_norm_countmatrix[,1:19745])
autoplot(quant.pca, data=t_norm_countmatrix, colour='sampleCondition', main="PCA of quantile normalization results prior to DE analysis")

The quantile normalization demonstrates a PCA that has similar clustering and % explanations relative to VSD normalization, but the differences between groups have been narrowed.

##Quantile normalization -> DE analysis and Inter-TCGA test

design <- t_norm_countmatrix
design <- design %>% dplyr::select(sampleCondition)

design$sampleCondition <- ifelse (design$sampleCondition=="TCGA_black.Basal", 0, as.character(design$sampleCondition))
design$sampleCondition <- ifelse (design$sampleCondition=="TCGA_white.Basal", 1, as.character(design$sampleCondition))

design$sampleCondition <- ifelse (design$sampleCondition==0 | design$sampleCondition==1, design$sampleCondition, NA)

design <- design %>% subset(is.na(sampleCondition)==FALSE)

design$TCGA_black.Basal <- ifelse (design$sampleCondition==0, 1, 0)
design$TCGA_white.Basal <- ifelse (design$sampleCondition==1, 1, 0)

design$sampleCondition <- NULL

quantids <- rownames(design)

design <- model.matrix(~0+design$TCGA_black.Basal+design$TCGA_white.Basal)
rownames(design) <- quantids
colnames(design) <- c("TCGA_black", "TCGA_white")

quantdata <- t(counts(ddsHTSeqMF))
quantdata <- as.data.frame(quantdata)
quantdata <- quantdata[quantids,]
quantdata <- t(quantdata)
quantdata <- quantdata[rowSums(quantdata) > 0, ] 

contr.matrix <- makeContrasts(TCGAblackvsTCGAwhite = TCGA_black-TCGA_white, levels=colnames(design))

dge <- DGEList(counts=quantdata)
dge <- calcNormFactors(dge)

v=voom(dge,design,plot=T, normalize="quantile")

fit <- lmFit(v, design)
fit <- contrasts.fit(fit, contrasts=contr.matrix)
fit <- eBayes(fit)
dt <- decideTests(fit)
summary(dt)
       TCGAblackvsTCGAwhite
Down                      0
NotSig                19186
Up                        0
hist(fit$p.value, ylim=c(0,3000), main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin TCGA black and \nTCGA white breast cancer patients\n quantile corrected")

qvals<-p.adjust(fit$p.value[,1], method='fdr')

df_limma <- data_frame(log2FoldChange = fit$coefficients[,1], 
                       pval = fit$p.value[,1],
                       padj = qvals,
                       temp = gsub("[.].+", "", rownames(fit)))

df_limma$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=df_limma$temp,
                     column="SYMBOL",
                     keytype="GENEID",        
                     multiVals="first")

with(df_limma, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in TCGA black and \nTCGA white breast cancer patients\nquantile corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(df_limma, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))

with(subset(df_limma, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

It has been previously demonstrated that there are expression difference between ethnic groups within the TCGA cohort, and this technique appears to drastically minimize this effect.

##Quantile normalization -> Basal Nigerian-TCGA white (cross population) test

design <- t_norm_countmatrix
design <- design %>% dplyr::select(sampleCondition)

design$sampleCondition <- ifelse (design$sampleCondition=="Nigerian.Basal", 0, as.character(design$sampleCondition))
design$sampleCondition <- ifelse (design$sampleCondition=="TCGA_white.Basal", 1, as.character(design$sampleCondition))

design$sampleCondition <- ifelse (design$sampleCondition==0 | design$sampleCondition==1, design$sampleCondition, NA)

design <- design %>% subset(is.na(sampleCondition)==FALSE)

design$Nigerian.Basal <- ifelse (design$sampleCondition==0, 1, 0)
design$TCGA_white.Basal <- ifelse (design$sampleCondition==1, 1, 0)

design$sampleCondition <- NULL

quantids <- rownames(design)

design <- model.matrix(~0+design$Nigerian.Basal+design$TCGA_white.Basal)
rownames(design) <- quantids
colnames(design) <- c("Nigerian", "TCGA_white")

quantdata <- t(counts(ddsHTSeqMF))
quantdata <- as.data.frame(quantdata)
quantdata <- quantdata[quantids,]
quantdata <- t(quantdata)
quantdata <- quantdata[rowSums(quantdata) > 0, ] 

contr.matrix <- makeContrasts( NigerianvsTCGA = Nigerian-TCGA_white, levels=colnames(design))

dge <- DGEList(counts=quantdata)
dge <- calcNormFactors(dge)

v=voom(dge,design,plot=T, normalize="quantile")

fit <- lmFit(v, design)
fit <- contrasts.fit(fit, contrasts=contr.matrix)
fit <- eBayes(fit)
dt <- decideTests(fit)
summary(dt)
       NigerianvsTCGA
Down             2894
NotSig          13944
Up               2550
hist(fit$p.value, ylim=c(0,3000), main="Histogram of unadjusted p-values of differential\n gene expression between basal breast cancers \nin Nigerian and \nTCGA white breast cancer patients\n quantile corrected")

qvals<-p.adjust(fit$p.value[,1], method='fdr')

df_limma <- data_frame(log2FoldChange = fit$coefficients[,1], 
                       pval = fit$p.value[,1],
                       padj = qvals,
                       temp = gsub("[.].+", "", rownames(fit)))

df_limma$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=df_limma$temp,
                     column="SYMBOL",
                     keytype="GENEID",        
                     multiVals="first")

with(df_limma, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential gene expression between basal \nbreast cancers in Nigerian and \nTCGA white breast cancer patients\nquantile corrected", xlim=c(-50,50), ylim=c(0,70)))
with(subset(df_limma, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(df_limma, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

As per Wei Shi’s preferred method: Counts were converted to log2 counts per million, quantile normalized and precision weighted with the ‘voom’ function of the limma package46,47. A linear model was fitted to each gene, and empirical Bayes moderated t-statistics were used to assess differences in expression48. Empirical Bayes moderated-t P values were computed relative to a fold-change cutoff of 1.5-fold. P values were adjusted to control the global FDR across all comparisons with the ‘global’ option of the limma package. Genes were considered differentially expressed if they had an FDR of 0.05.

##Limma’s removeBatchEffect for visualization only

mat <- assay(vsd) 
mat <- limma::removeBatchEffect(mat, vsd$condition1)
assay(vsd) <- mat
plotPCA(vsd, intgroup=c("condition1", "condition2"))

#Heatmap to assess clusters post-batch effect removal
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 250)
hm  <- assay(vsd)[ topVarGenes, ]
hm  <- hm - rowMeans(hm)
anno <- as.data.frame(colData(vsd)[, c("condition1", "condition2")])
colnames(anno) <- c("Ethnicity", "PAM50-subtype")
rownames(anno) <- colnames(vsd)
pheatmap(hm, annotation_col = anno, cluster_rows=TRUE, cluster_cols=TRUE, clustering_distance_rows = "correlation", clustering_distance_cols = "correlation", clustering_method = "complete", show_rownames=FALSE, show_colnames = FALSE, main="Heat map: Relative VST-transformed\ncounts across samples after Limma:removeBatchEffect")

This heatmap after batch correction with limma’s removeBatchEffect method is using the top 250 genes and is competely heirarchically clustered. These clusters are difficult to separate from each other, and it is not clear that the Limma removeBatchEffect method does not overcompensate and competely flatten existing biological signal.

In addition, the voom method is considered more appropriate for data with significant noise / abnormalities.

#Internal data validity assessments We will test the internal validity of using DESeq2 using comparisons between subtypes, comparisons within the Nigerian cohort. This will not use the batch effect corrected approach with SVA given the demonstrations we showed earlier of the inappropriate exaggeration of positive findings.

##Inter-subtype gene expression testing

rescrosstype<- results(ddsMF, contrast=c("sampleCondition", "Nigerian.Basal", "Nigerian.Her2"), alpha=0.05)
nrow(rescrosstype)
[1] 19745
#resBasal2$foldChange <- NA
#row.pos <- which(! is.na(resBasal2$log2FoldChange) & 
#                resBasal2$log2FoldChange >= 0)
#row.neg <- which(! is.na(resBasal2$log2FoldChange) & 
#                resBasal2$log2FoldChange < 0)
#resBasal2$foldChange[row.pos] <- 2^resBasal2$log2FoldChange[row.pos]
#resBasal2$foldChange[row.neg] <- -2^((-1) * resBasal2$log2FoldChange[row.neg])
#resBasal2 <- data.frame(id = row.names(resBasal2), resBasal2)

#before <- nrow(resBasal2)
#resBasal2 <- resBasal2[!is.na(resBasal2$foldChange) & ! is.na(resBasal2$padj),];
#after <- nrow(resBasal2)
#print(paste0('Genes removed = ', (before - after), 
#             ' (fold change is NA)'))


rescrosstype <- rescrosstype[(rescrosstype$log2FoldChange >= fc | rescrosstype$log2FoldChange <= -fc),]
rescrosstype <- subset(rescrosstype, padj < fdr)
nrow(rescrosstype)
[1] 2637
restempcheck <- NULL
restempcheck <- lfcShrink(ddsMF, contrast=c("sampleCondition", "Nigerian.Basal", "Nigerian.Her2"), res = rescrosstype, type="ashr", optmethod = "mixEM")

restempcheck$temp <- row.names(restempcheck)
restempcheck$temp <- gsub("[.].+", "", restempcheck$temp)

restempcheck$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restempcheck$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restempcheck$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between \n Nigerian basal and Nigerian Her2 breast cancer patients")

with(restempcheck, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential\n gene expression between \n Nigerian basal and Nigerian Her2 breast cancer patients", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restempcheck, padj<0.01 & abs(log2FoldChange)>25), points(log2FoldChange, -log10(pvalue), pch=20, col="blue"))

with(subset(restempcheck, padj<0.01 & abs(log2FoldChange)>25), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))

Warning: The above code chunk cached its results, but it won’t be re-run if previous chunks it depends on are updated. If you need to use caching, it is highly recommended to also set knitr::opts_chunk$set(autodep = TRUE) at the top of the file (in a chunk that is not cached). Alternatively, you can customize the option dependson for each individual chunk that is cached. Using either autodep or dependson will remove this warning. See the knitr cache options for more details.

The unadjusted p-value histogram and volcano plot appear appropriate without batch correction in the cross-subtype test. It is reassuring that we are seeing more normal volcano plots on the within-assay comparison.

genename <- "ENSG00000141736.9"
her2comparematrix <- plotCounts(ddsMF, gene = genename, intgroup=c("condition1", "condition2"), main="Distribution of HER2 expression across groups", returnData = TRUE)

her2comparematrix <- her2comparematrix %>% subset(condition1=="Nigerian" & (condition2=="Her2" | condition2=="Basal"))
ggplot(her2comparematrix, aes (x=condition2, y=count)) + geom_point() + labs(title="Comparison of normalized RNAseq counts of HER2 expression between\nNigerian basal tumors and Nigerian HER2 tumors") + xlab("Breast cancer Subtype") + ylab("Normalized RNAseq count (pre DE)") + scale_y_log10()

her2tempassay <- subset(rescrosstype, rownames(rescrosstype)=="ENSG00000141736.9")
her2tempassay[,1:6]
log2 fold change (MLE): sampleCondition Nigerian.Basal vs Nigerian.Her2 
Wald test p-value: sampleCondition Nigerian.Basal vs Nigerian.Her2 
DataFrame with 1 row and 6 columns
                          baseMean    log2FoldChange             lfcSE
                         <numeric>         <numeric>         <numeric>
ENSG00000141736.9 23219.1675125931 -4.79084247675308 0.339508733140403
                               stat               pvalue
                          <numeric>            <numeric>
ENSG00000141736.9 -14.1111023343598 3.24466116268002e-45
                                  padj
                             <numeric>
ENSG00000141736.9 2.88791066784335e-41
tempassay <- as.data.frame(assay(ddsHTSeqMF))
tempassay <- dplyr::select(tempassay, contains("LIB"))
tempassay <- dplyr::select(tempassay, matches('Basal|Her2'))
tempassay <- subset(tempassay, rownames(tempassay)=="ENSG00000141736.9")
tempassaybasal <- dplyr::select(tempassay, matches('Basal'))
tempassayher2 <- dplyr::select(tempassay, matches('Her2'))

tempassaybasal <-as.numeric(tempassaybasal)
tempassayher2 <-as.numeric(tempassayher2)
t.test (tempassaybasal, tempassayher2)

    Welch Two Sample t-test

data:  tempassaybasal and tempassayher2
t = -4.5324, df = 25.048, p-value = 0.0001247
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -316152.6 -118614.2
sample estimates:
 mean of x  mean of y 
  8025.375 225408.769 

The cross subtype comparison did find a statistically significant difference between HER2 expression between the Nigerian basal and Her tumors, which is appropriate when the data is visualized as well.

We will now assess arbitrary within-Nigerian samples to identify any other within sample sources of bias.

##Arbitrary grouping of Nigerian patients cross-checked with each other

table(batchval, sampleConditionPAM50)
         sampleConditionPAM50
batchval  Basal Her2 LumA LumB Normal PAM_other
  batch1      8    7    1    3      2         0
  batch23     9   12    5    8      2         0
  batch4      6    2    5    1      1         0
  batch5      9    5    5    6      2         0
  batchT     42    5   12   13      0        14
sampletableNigerian <- sampleTable2
sampletableNigerian <- subset(sampletableNigerian, batch=="batch23" | batch=="batch5")
sampletableNigerian$condition1 <- NULL

dsHTSeqMFNigerian <- DESeqDataSetFromHTSeqCount(sampleTable=sampletableNigerian,
                                       directory=FOLDER,
                                       design=~batch)
dsHTSeqMFNigerian <- dsHTSeqMFNigerian[rowSums(counts(dsHTSeqMFNigerian)) > 0, ] #Pre-filtering the dataset by removing the rows without information about gene expression
ddsMFNigerianexample <- DESeq(dsHTSeqMFNigerian)

resNigerianNigerian<- results(ddsMFNigerianexample, contrast=c("batch", "batch23", "batch5"), alpha=0.05)

resNigerianNigerian <- resNigerianNigerian[(resNigerianNigerian$log2FoldChange >= fc | resNigerianNigerian$log2FoldChange <= -fc),]
resNigerianNigerian <- subset(resNigerianNigerian, padj < fdr)
nrow(resNigerianNigerian) #More genes are considered significant with the batch correction method
[1] 2262
restempcheck <- NULL
restempcheck <- lfcShrink(ddsMFNigerianexample, contrast=c("batch", "batch23", "batch5"), res = resNigerianNigerian, type="ashr", optmethod = "mixEM")

restempcheck$temp <- row.names(restempcheck)
restempcheck$temp <- gsub("[.].+", "", restempcheck$temp)

restempcheck$symbol <- mapIds(EnsDb.Hsapiens.v75,
                     keys=restempcheck$temp,
                     column="SYMBOL",
                     keytype="GENEID",           
                     multiVals="first")

hist(restempcheck$pvalue, main="Histogram of unadjusted p-values of differential\n gene expression between \n grouped Nigerian breast cancer patients")

with(restempcheck, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot of differential\n gene expression between \n grouped Nigerian breast cancer patients", xlim=c(-50,50), ylim=c(0,70)))
with(subset(restempcheck, padj<0.05 & (2^(abs(log2FoldChange))>50)), points(log2FoldChange, -log10(padj), pch=20, col="blue"))
with(subset(restempcheck, padj<0.05 & (2^(abs(log2FoldChange))>50)), textxy(log2FoldChange, -log10(padj), labs=symbol, cex=.5))


sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] Glimma_1.12.0               RColorBrewer_1.1-2         
 [3] preprocessCore_1.46.0       ashr_2.2-32                
 [5] ggfortify_0.4.7             calibrate_1.7.2            
 [7] MASS_7.3-51.4               sva_3.32.1                 
 [9] mgcv_1.8-28                 nlme_3.1-140               
[11] EnsDb.Hsapiens.v75_2.99.0   ensembldb_2.8.0            
[13] AnnotationFilter_1.8.0      GenomicFeatures_1.36.1     
[15] hexbin_1.27.3               stringi_1.4.3              
[17] dplyr_0.8.1                 affy_1.62.0                
[19] checkmate_1.9.3             pathview_1.24.0            
[21] org.Hs.eg.db_3.8.2          AnnotationDbi_1.46.0       
[23] clusterProfiler_3.12.0      pheatmap_1.0.12            
[25] genefilter_1.66.0           vsn_3.52.0                 
[27] RUVSeq_1.18.0               EDASeq_2.18.0              
[29] ShortRead_1.42.0            GenomicAlignments_1.20.0   
[31] Rsamtools_2.0.0             Biostrings_2.52.0          
[33] XVector_0.24.0              DESeq2_1.24.0              
[35] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
[37] BiocParallel_1.18.0         matrixStats_0.54.0         
[39] Biobase_2.44.0              GenomicRanges_1.36.0       
[41] GenomeInfoDb_1.20.0         IRanges_2.18.1             
[43] S4Vectors_0.22.0            BiocGenerics_0.30.0        
[45] edgeR_3.26.4                limma_3.40.2               
[47] ggbiplot_0.55               scales_1.0.0               
[49] plyr_1.8.4                  ggplot2_3.1.1              
[51] gplots_3.0.1.1             

loaded via a namespace (and not attached):
  [1] R.utils_2.8.0          tidyselect_0.2.5       RSQLite_2.1.1         
  [4] htmlwidgets_1.3        DESeq_1.36.0           munsell_0.5.0         
  [7] codetools_0.2-16       withr_2.1.2            colorspace_1.4-1      
 [10] GOSemSim_2.10.0        knitr_1.23             rstudioapi_0.10       
 [13] pscl_1.5.2             DOSE_3.10.1            labeling_0.3          
 [16] git2r_0.25.2           KEGGgraph_1.44.0       urltools_1.7.3        
 [19] GenomeInfoDbData_1.2.1 mixsqp_0.1-97          hwriter_1.3.2         
 [22] polyclip_1.10-0        bit64_0.9-7            farver_1.1.0          
 [25] rprojroot_1.3-2        xfun_0.7               doParallel_1.0.14     
 [28] R6_2.4.0               locfit_1.5-9.1         bitops_1.0-6          
 [31] fgsea_1.10.0           gridGraphics_0.4-1     assertthat_0.2.1      
 [34] ggraph_1.0.2           nnet_7.3-12            enrichplot_1.4.0      
 [37] gtable_0.3.0           workflowr_1.4.0        rlang_0.3.4           
 [40] splines_3.6.0          rtracklayer_1.44.0     lazyeval_0.2.2        
 [43] acepack_1.4.1          europepmc_0.3          BiocManager_1.30.4    
 [46] yaml_2.2.0             reshape2_1.4.3         backports_1.1.4       
 [49] qvalue_2.16.0          Hmisc_4.2-0            tools_3.6.0           
 [52] ggplotify_0.0.3        affyio_1.54.0          ggridges_0.5.1        
 [55] Rcpp_1.0.1             base64enc_0.1-3        progress_1.2.2        
 [58] zlibbioc_1.30.0        purrr_0.3.2            RCurl_1.95-4.12       
 [61] prettyunits_1.0.2      rpart_4.1-15           viridis_0.5.1         
 [64] cowplot_0.9.4          ggrepel_0.8.1          cluster_2.0.9         
 [67] fs_1.3.1               magrittr_1.5           data.table_1.12.2     
 [70] DO.db_2.9              triebeard_0.3.0        truncnorm_1.0-8       
 [73] SQUAREM_2017.10-1      ProtGenerics_1.16.0    aroma.light_3.14.0    
 [76] hms_0.4.2              evaluate_0.14          xtable_1.8-4          
 [79] XML_3.98-1.20          gridExtra_2.3          compiler_3.6.0        
 [82] biomaRt_2.40.0         tibble_2.1.3           KernSmooth_2.23-15    
 [85] crayon_1.3.4           R.oo_1.22.0            htmltools_0.3.6       
 [88] Formula_1.2-3          tidyr_0.8.3            geneplotter_1.62.0    
 [91] DBI_1.0.0              tweenr_1.0.1           Matrix_1.2-17         
 [94] R.methodsS3_1.7.1      gdata_2.18.0           igraph_1.2.4.1        
 [97] pkgconfig_2.0.2        rvcheck_0.1.3          foreign_0.8-71        
[100] foreach_1.4.4          xml2_1.2.0             annotate_1.62.0       
[103] stringr_1.4.0          digest_0.6.19          graph_1.62.0          
[106] rmarkdown_1.13         fastmatch_1.1-0        htmlTable_1.13.1      
[109] curl_3.3               gtools_3.8.1           jsonlite_1.6          
[112] viridisLite_0.3.0      pillar_1.4.1           lattice_0.20-38       
[115] KEGGREST_1.24.0        httr_1.4.0             survival_2.44-1.1     
[118] GO.db_3.8.2            glue_1.3.1             UpSetR_1.4.0          
[121] iterators_1.0.10       png_0.1-7              bit_1.1-14            
[124] Rgraphviz_2.28.0       ggforce_0.2.2          blob_1.1.1            
[127] latticeExtra_0.6-28    caTools_1.17.1.2       memoise_1.1.0